arXiv:1505.04123vl [cs.LG] 15 May 2015 


Margins, Kernels and Non-linear Smoothed Perceptrons 


Aaditya Ramdas 
Machine Learning Department 
Carnegie Mellon University 

aramdas@cs.emu.edu 


Javier Pena 

Tepper School of Business 
Carnegie Mellon University 

j fp@andrew.emu.edu 


May 18, 2015 


Abstract 

We focus on the problem of finding a non-linear classification function that lies in a Reproducing Kernel 
Hilbert Space (RKHS) both from the primal point of view (finding a perfect separator when one exists) and 
the dual point of view (giving a certificate of non-existence), with special focus on generalizations of two 
classical schemes - the Perceptron (primal) and Von-Neumann (dual) algorithms. 

We cast our problem as one of maximizing the regularized normalized hard-margin (p) in an RKHS and 
rephrase it in terms of a Mahalanobis dot-product/semi-norm associated with the kernel’s (normalized and 
signed) Gram matrix. We derive an accelerated smoothed algorithm with a convergence rate of " given 
n separable points, which is strikingly similar to the classical kemelized Perceptron algorithm whose rate 
is When no such classifier exists, we prove a version of Gordan’s separation theorem for RKHSs, and 
give a reinterpretation of negative margins. This allows us to give guarantees for a primal-dual algorithm 
that halts in min{-^, -^} iterations with a perfect separator in the RKHS if the primal is feasible or a 
dual e-certificate of near-infeasibility. 


1 Introduction 

We are interested in the problem of finding a non-linear separator for a given set of n points a;i,..., G 
with labels j/i,..., G {il}- Finding a linear separator can be stated as the problem of finding a unit vector 
w (if one exists) such that for all i 

y^{'w^Xi)>0 i.e. sign{w^ x^) = y^. (1) 

This is called the primal problem. In the more interesting non-linear setting, we will be searching for functions 
/ in a Reproducing Kernel Hilbert Space (RKHS) J^k associated with kernel K (to be defined later) such 
that for all i 

Vifixi) > 0 . ( 2 ) 

We say that problems ([T]), (|2]) have an unnormalized margin p > 0, if there exists a unit vector w, such that 
for all i, 

yi{w^x^) > p or ytfixi) > p. 

True to the paper’s title, margins of non-linear separators in an RKHS will be a central concept, and we 
will derive interesting smoothed accelerated variants of the Perceptron algorithm that have convergence rates 
(for the aforementioned primal and a dual problem introduced later) that are inversely proportional to the 
RKHS-margin as opposed to inverse squared margin for the Perceptron. 

The linear setting is well known by the name of linear feasibility problems - we are asking if there exists 
any vector w which makes an acute angle with all the vectors yiXi, i.e. 

{XYYw > 0„, (3) 
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where Y := diag(y),X := [xi, ...,Xn]- This can be seen as finding a vector w inside the dual cone of 
cone{y^Xi}. 

When normalized, as we will see in the next section, the margin is a well-studi ed notion of condition ing 
for these problems. It can be t hought of as the w i dth of the feasibility cone as in Freund & Veral ( 19991) . a 
radius of well-posedness as in Cheung & Cucken (1200ih . and its inverse can be seen as a special case of a 
condition number defined by Renegail(ll995h forthese systems. 


1.1 Related Work 


In this paper we focus on the famous Per ceptron algorithm from Rosenblatt ( 1958h and the l ess-famous Von- 
Neum ann algorithm from DantzigI ( 1992h that we introduce in later sections. As mentioned by Enelman & Freund 
toodti . in a technical report by the same name, Nesterov pointed out in a note to the authors that the latter is 
a special case of the no w-popular Frank-Wolfe algorith m. 

Our work builds on Soheili & Penal ( 2012 . 2013^ from the field of optimization - we generalize the 
setting to learning functions in RKHSs, extend the algorithms, simplify proofs, and simultaneously bring new 
perspectives to it. There is extensive literature around the Perceptron algorithm in the learning community; we 
restrict ourselves to discussing only a few directly related papers, in order to point out the several differences 
from existing work. 

We provide a gene ral unified proof in the Appendix which borrows ide as from accelerated smoothing 


metho ds developed by Nesterov (2005) - while this algorithm and others by Nemirovski ( 20041) . Saha et al 


(l 20 nh can achieve similar rates for the same problem, those algorithms do not possess the simplicity of 
the Perceptron or Von-Neumann algorithms and our variants, and also don’t look at the infeasible setting or 
primal-dual algorithms. 


Accelerated smoothing techniques have also been seen in the learning literature like in lTseng|(l2008h and 
many others. Flowever, most of these deal with convex-concave problems where both sets involved are the 
probability simplex (as in game theory, boosting, etc), while we deal with hard margins where one of the 
sets is a unit £2 ball. Hence, their al gorithms/results are not extendable to ours trivially. This work is also 
connected to the idea of e-coresets by Clarkson ( 201(]h. though we w ill not explore that angle. 

A related algorithm is called the Winnow by iLittlestonel (1199 ih - this works on the ii margin and is a 
saddle point problem over two simplices. One can ask whether such accelerated smoothed versions exist for 
the Winnow. The answer is in the affirmative - however such algorithms look completely different from the 
Winnow, while in our setting the new algorithms retain the simplicity of the Perceptron. 


1.2 Paper Outline 

Sec.2 will introduce the Perceptron and Normalized Perceptron algorithm and their convergence guarantees 
for linear separability, with specific emphasis on the unnormalized and normalized margins. Sec.3 will then 
introduce RKHSs and the Normalized Kernel Perceptron algorithm, which we interpret as a subgradient 
algorithm for a regularized normalized hard-margin loss function. 

Sec.4 describes the Smoothed Normalized Kernel Perceptron algorithm that works with a smooth approxi¬ 
mation to the original loss function, and outlines the argument for its faster convergence rate. Sec.5 discusses 
the non-separable case and the Von-Neumann algorithm, and we prove a version of Gordan’s theorem in 
RKHSs. 

We finally give an algorithm in Sec.6 which terminates with a separator if one exists, and with a dual 
certificate of near-infeasibility otherwise, in time inversely proportional to the margin. Sec.7 has a discussion 
and some open problems. 
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2 Linear Feasibility Problems 

2.1 Perceptron 

The classical perceptron algorithm can be stated in many ways, one is in the following form 


Algorithm 1 Perceptron 
Initialize wg = 0 
for /c = 0 , 1 , 2 , 3, ... do 

if sign('u;jra;i) ^ yi for some i then 

Wk+i ■■= Wk + ViXi 

else 

Halt; Return Wk as solution 

end if 
end for 


It comes with the following classic guarantee as proved bv iBlockI (Il962h and Novikofll ( 1962h : If there 


exists a unit vector u G such that YX^u > p > 0, then a perfect separator will be found in 
iterations/mistakes. 

The algorithm works when updated with any arbitrary point {xi, yf) that is misclassified; it has the same 
guarantees when w is updated with the point that is misclassified by the largest amount, argmin^ yiW^Xi. 
Alternately, one can define the probability distribution over examples 


p{w) = arg min {YX^w,p), 


(4) 


where An, is the n-dimensional probability simplex. 

Intuitively, p picks the examples that have the lowest margin when classified by w. One can also normalize 
the updates so that we can maintain a probability distribution over examples used for updates from the start, 
as seen below: 


Algorithm 2 Normalized Perceptron 
Initialize wq = 0,po = 0 
for /c = 0 , 1,2,3,... do 
if YX"^Wk > 0 then 
Exit, with Wk as solution 
else 

Wk+i := (1 - 0k)wk + 9kXYp{wk) 

end if 
end for 


Remark. Normalized Perceptron has the same guarantees as perceptron - the Perceptron can perform its 
update online on any misclassified point, while the Normalized Perceptron performs updates on the most 
misclassified point(s), and yet there does not seem to be any change in performance. However, we will soon 
see that the ability to see all the examples at once gives us much more power. 

2.2 Normalized Margins 

If we normalize the data points by the £2 norm, the resulting mistake bound of the perceptron algorithm 
is slightly different. Let X 2 represent the matrix with columns a;i/||a:i|| 2 . Define the unnormalized and 
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normalized margins as 


p := sup inf (yX'w.p)^ 

||„||^^ 1 PGA„ 

P 2 '■= sup inf {YXjw,p). 

Il^ll 1PGA„ 


Remark. Note that we have supy^ii^^]^ in the definition, this is equivalent to sup||^|| 2<2 iff P 2 > 0. 

Normalized Perception has the following guarantee on X 2 : If P 2 >0, then it finds a perfect separator in 


P 2 


iterations. 


Remark. Consider the max-margin separator u* for X (which is also a valid perfect separator for X 2 ). 
Then 


f VtxJ u* 


maxi ||aii||2 


I \maxj ||a:i||2 

^ . f y^xJu 

< sup mm -r—^ 


< min 


= P2- 


Vixju* 

IIX.II 2 


Hence, it is always better to normalize the data as pointed out in iGraenel et al.l (1200Ih . This idea extends to 
RKHSs, motivating the normalized Gram matrix considered later. 

Example Consider a simple example in Assume that + points are located along the line 6 x 2 = 8a;i, 
and the — points along 8 a ;2 = 611 , for 1/r < ||x ||2 < r, where r > 1. The max-margin linear separator 
will be Xi = X 2 . If all the data were normalized to have unit Euclidean norm, then all the -f points would all 
be at (0.6,0.8) and all the — points at (0.8, 0.6), giving us a normalized margin of p 2 ~ 0.14. Unnormalized, 
the margin is pm 0.14/r and max^ ||xi ||2 = r. Hence, in terms of bounds, we get a discrepancy of r^, which 
can be arbitrarily large. 

Winnow The question arises as to wh ich norm w e shoul d normalize by. There is a now classic algorithm 
in machine learning, called Winnow by iLittlestonel (Il99lh or Multiplicate Weights. It works on a slight 
transformation of the problem where we only need to search for u S It comes with some very well- 
known guarantees - If there exists a u G such that YX^u > p > 0, then feasibility is guaranteed in 
llwllf maxi lloill^ log n/p^ iterations. The appropriate notion of normalized margin here is 

Pi := max mm (YX]^w,p), 

wGAd pGAn 


where X^o is a matrix with columns Xi/||xi||oo- Then, the appropriate iteration bound is logn/p^. We will 
return to this £i-margin in the discussion section. In the next section, we will normalize by using the kernel 
appropriately. 


3 Kernels and RKHSs 


The the ory of Reproducing Kernel Hilbert Spaces (RKHSs) has a rich history, and for a detailed introduction, 
refer to lScholkopf & Smola ( 2002 ). Let K : x R be a symmetric positive dehnite kernel, giving 

rise to a Reproducing Kernel Hilbert Space Xk with an associated feature mapping at each point x G 
called (()a; : R'^ — Xk where fxi-) = K{x, .) i.e. fixiv) = K{x,y). Xk has an associated inner product 
{fu,fv)K = K{u,v). For any / £ Xk, we have /(x) = {f,(j)x)K- 
Dehne the normalized feature map 


= /X ^ ^ •= • 

\/A(x,x) 
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For any function / G J^k, we use the following notation 


Yf{X) := {f,Y(l,x)K = [yr{fA..)KTl = 

We analogously define the normalized margin here to be 


Vifi^i) 


.^yK{xi,xi)\ 1 


PK ■■= sup inf (Yf{X),p). 
||/||K=iPe^" 


Consider the following regularized empirical loss function 


Denoting t := 
function 


i(/) = {^sup ^-y/(x),p)| + i|l 
\k > 0 and writing f = t = tf, let us 


l?r- 


(5) 


(6) 


calculate the minimum value of this 


inf L{f) = inf inf SMp {-{tf,Y^x)K,p) + \ 
t>0 ||/||j^ = i 

= inf {-tpif + 

t>o ‘‘ ’ 

= -\p\: whenf = p/f>0. 


(7) 


Since maxpgA„ \^—Yf{X),pj is some empirical loss function on the data and Wmi is an increasing 

function of H/Hx, the Representer Theorem (ISchoUcopf et al.ll200lh implies that the minimize!' of the above 
function lies in the span of (also the span of the yipxi^)- Explicitly, 


arg min L{f) = = (Ypx, a). 

1—1 


Substituting this back into Eq.®, we can define 


L{a) := sup (-a,p) !> + 2 l|a|lG> 

Ipga„ 


(8) 


(9) 


where G is a normalized signed Gram matrix with Ga = 1, 

yiyjK(xi,Xj 


Gp - G^J ~ (y^(t>x,,yj(px,)K, 


and (p, a)^ := Ga, ||q!||g := V Ga. One can verify that G is a PSD matrix and the G-norm ||.||g is a 
semi-norm, whose properties are of great importance to us. 


3.1 Some Interesting and Useful Lemmas 

The first lemma justifies our algorithms’ exit condition. 

Lemma 1. L{a) < 0 implies Ga > 0 and there exists a perfect classifier iffGa > X). 

Proof. L{a) < 0 => suppgA„ {~Ga,p) < 0 Ga > 0. Ga > 0 => /„ := {a, Yfix) is perfect since 

yjfa{xj) -A yiyjK{xi,Xj) 

^/K{xj,Xj) ^ ^jK{xi,Xi)K(xj,Xj) 

= Gja > 0 . 

If a perfect classifier exists, then px > 0 by definition and 

L{f*) = L{a*) = -^pji <0 ^ Ga>0, 

where f*,a* are the optimizers of L{f), L{a). □ 
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The second lemma bounds the G-norm of vectors. 

Lemma 2. For any a G R”, ||a||G < ||Q!||i < -v/nllalb- 
Proof. Using the triangle inequality of norms, we get 


Va^Ga 


y {{a,Y^x),{a,Yfx)^ ^ 

W^OnyifxiWx < X! \Wiyi^Xi\\K 

i i 


< 


Ei“*i 


Vi 



y/K{Xi,Xi) 


K 




where we used {f^i,4>xi)K=K{xi,Xi). □ 

The third lemma gives a new perspective on the margin. 

Lemma 3. When px > 0, / maximizes the margin iff pxf optimizes L{ f). Hence, the margin is equivalently 


Pk = sup inf (a,p) < IIpIIg forallpeAn. 

||c||G=ipeA„ 

Proof Let fp be any function with ||/p||if = 1 that achieves the max-margin px > 0. Then, it is easy to 
plug pxfp into Eq. (|6]l and verify that L{pxfp) = —\p\ and hence pxfp minimizes L{f). 

Similarly, let be any function that minimizes L{f ), i.e. achieves the value L{fif) = Defining 

t '■= and examining Eq. Q, we see that L^fi,) cannot achieve the value —\p\ unless t = px and 

snPpgA„ which means that/i/pif must achieve the max-margin. 

Hence considering only / = caUifxi is acceptable for both. Plugging this into Eq. © gives the 
equality and 


px = inf sup {a,p) < sup {a,p) 
< Ibllc by applying Cauchy-Schwartz 


(can also be seen by going back to function space). 


□ 


4 Smoothed Normalized Kernel Perceptron 

Define the distribution over the worst-classified points 


pif) 

:= argmm 

or p{a) 

:= argmin(a,p). 

Implicitly fk+i 

= - 0k)fk + dk{Y4>x,pUk)) 


= fk - Ok (^fk - (yfxjpifk))^ 


= fk- 9kdL{fk) 


and hence the Normalized Kernel Perceptron (NKP) is a subgradient algorithm to minimize L{f) from Eq. 

( 01 . 

Remark. Lemma0yields deep insights. Since NKP can get arbitrarily close to the minimizer of strongly 
convex L{f), it also gets arbitrarily close to a margin maximizer. It is known that it finds a perfect classifier 
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Algorithm 3 Normalized Kernel Perceptron (NKP) 
Set ao := 0 
for /c = 0, 1 , 2 , 3 ,... do 
if Gak > 0„ then 

Exit, with afe as solution 
else 

FPT 

ak+i := (1 - 0k)ak + 0kp{ak) 

end if 
end for 


in 1 /iterations - we now additionally infer that it will continue to improve to find an approximate max- 
margin classifier. While both classical and normalized Perceptrons find perfect classifiers in the same time, 
the latter is guaranteed to improve. 

Remark, ak+i is always a probability distribution. Curiously, a guarantee that the solution will lie in A„ 
is not made by the Representer Theorem in Eq. ® - any a G R" could satisfy Lemma [T] However, since 
NKP is a subgradient method for minimizing Eq. (|6]l, we know that we will approach the optimum while only 
choosing a G A„. 

Define the smooth minimizer analogous to Eq. (doll as 


Pt,{a) 


where d{p) 


arg mm |(a,p)p + pd{p)'^ 

g-Ga/k. 

log Pi + log n 


( 11 ) 


( 12 ) 


is 1-strongly convex with respect to the fi-norm (lNesterovll2005h. Define a smoothened loss function as in 


Algorithm 4 Smoothed Normalized Kernel Perceptron 

Set oo = In/n, Po ■= 2, po ■= Pkoio^o) 
for fc = 0,1,2,3,... do 

if Gak > On then 

Halt: ak is solution to Eq. ® 
else 

•“ FP3 

Ofe+i := (1 - 0k)iak + OkPk) + dlPkki<^k) 

pk+l “ (t dk^Pk 

Pk+1 ■■= (1 - 0k)pk + dkPkk+Actk+l) 

end if 
end for 


Eq. (H 


L^(a) = sup - (a,p) - pd{p) \ + ^\\a\\G. 

PGA„ I ^ J 

Note that the maximizer above is precisely Pfj.{a). 

Lemma 4 (Lower Bound). At any step k, we have 

Lkk{oik) > L{ak) - Pk logn. 
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Proof. First note that sup^g^^ d{p) = logn. Also, 

sup { - {a,p) - p.d{p)] 

p6A„ 

> sup { - (a,p) } - sup [pdijp)]. 


Combining these two facts gives us the result. 


□ 


Lemma 5 (Upper Bound). In any round k, SNKP satisfies 

ipjafe) < -^IbfellG- 

Proof. We provide a concise, self-contained and unified proof by inducti on in the Appen dix for Lemma |5] 
and Lemma[8] borrowing ideas from Nesterov’s excessive gap technique ( Nestero\]l2005 ) for smooth mini¬ 
mization of structured non-smooth functions. □ 


Finally, we combine the above lemmas to get the following theorem about the performance of SNKP. 
Theorem 1. The SNKP algorithm finds a perfect classifier f S Kk when one exists in O ^ ^ iterations. 

Proof. Lemma|4]gives us for any round k. 


{at) > L{ak) - /ifelogn. 

From Lemmas [Jill] we get 

ipjafe) < -\plGpk < 

Combining the two equations, we get that 


L{ak) < pfclogn- \p\. 

Noting that pk = (^k+i)(k+ 2 ) ^ (k+if ’ L{oik) < 0 (and hence we solve the problem by 

Lemma [T]i after at most k = 21/2 log n/pif steps. □ 


5 Infeasible Problems 


What happens when the points are not separable by any function / G Kk’I We would like an algorithm that 
terminates with a solution when there is one, and terminates with a certificate of non-separability if there isn’t 
one. The idea is based o n theorems of the alternative like Farkas’ Lemma, specifically a version of Gordan’s 
theorem ( ChvatalllT983 ): 


Lemma 6 (Gordan’s Thm). Exactly one of the following two statements can be true 
1. Either there exists a w G such that for all i, 

yi{w^Xi) > 0 , 


2. Or, there exists a p G A„ such that 


or equivalently = 0- 


\\XYph = 0, 


(13) 
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Algorithm 5 Normalized Von-Neumann (NVN) 

Initialize po = lra/?T-: wq = XYpQ 

for A: = 0,1,2,3,... do 
if\\XYpk \\2 < ethen 

Exit and return pk as an e-solution to (fOT l 
else 

j := arguimiyixjWk 

6k := argminA6[o.i] ||(1 - X)wk + ^VjXjh 

Pk+l ■— (1 6k)pk 6kGj 

Wk+1 := XYpk+i = (1 - 0k)wk + OkUjXj 

end if 
end for 


As mentioned in the introduction, the primal problem can be interpreted as finding a vector in the interior 
of the dual cone of cone{yiXi], which is infeasible the dual cone is flat i.e. if cone{yiXi} is not pointed, 
which happens when the origin is in the convex combination of t/iX^s. 

We will generalize the following algorithm for linear feasibility problems, that can be dated bac k to Von- 
Neum ann, who mentioned it in a private communication with Dantzig, who later studied it himself (IDantzig 


1992). 


This algorithm comes with a guarantee: If the problem 0 is infeasible, then the above algorithm will 
ter minate with an e-approxim ate solution to M3^ in 1 /e^ iterations. 

Epelman & Freund (|2000|) proved an incomparable bound - Normalized Von-Neumann (NVN) can com¬ 
pute an e-solution to ( fOl) in O log ( 7 )^ can also find a solution to the primal (using Wk) in O 
when it is feasible. 

We derive a smoothed variant of NVN in the next section, after we prove some crucial lemmas in RKHSs. 


5.1 A Separation Theorem for RKHSs 

While finite dimensional Euclidean spaces come with strong separation guarantees that come under various 
names like the separating hyperplane theorem, Gordan’s theorem, Farkas’ lemma, etc, the story isn’t always 
the same for infinite dimensional function spaces which can often be tricky to deal with. We will prove an 
appropriate version of such a theorem that will be useful in our setting. 

What follows is an interesting version of the Hahn-Banach separation theorem, which looks a lot like 
Gordan’s theorem in finite dimensional spaces. The conditions to note here are that either Ga > 0 or 

Ibllc = 0 . 


Theorem 2. Exactly one of the following has a solution: 
1. Either 3/ G Xk such that for all i, 


yifjxi) 

yjK{Xi,Xi) 


= {f,yi^xi)K > 0 ie. 


Ga > 0, 


2. Or 3p G A„ such that 


'^Pryi^xi = 0 e T'if i.e. IIpIIg = 0 . 


(14) 
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Proof. Consider the following set 


Q = = [^Piyi^xi,^Pi] ■■ P & 


= conv 


{yi^X\ 7 1 )? ■•■5 {yn^Xn 7 1 ) 


C ipK ^ 


If (2) does not hold, then it implies that (0,1) ^ Q. Since Q is closed and convex, we can find a separating 
hyperplane between Q and (0,1), or in other words there exists (/, t) € Fk x such that 

> 0\/{g,s)€Q 

and <^(/,f),(0,1)^ < 0. 

The second condition immediately yields t < 0. The first condition, when applied to {g, s) = {yifxt, 1) G Q 
yields 

if, yi^Xi)K + t 

yifixi) 

w — . = 

y/K{Xi,Xi) 

since f < 0, which shows that (1) holds. 

It is also immediate that if (2) holds, then (1) cannot. □ 

Note that G is positive semi-definite - infeasibility requires both that it is not positive definite, and also 
that the witness to Gp = 0 must be a probability vector. Similarly, while it suffices that Ga > 0 for some 
a € M", but coincidentally in our case a will also lie in the probability simplex. 


> 0 
> 0 


5.2 The infeasible margin 

Note that constraining \\f\\K = 1 (or ||a||G = 1) in Eq. (|5]l and Lemma [3 allows pK to be negative in the 
infeasible case. If it was <, then pK would have been non-negative because / = 0 (ie a = 0) is always 
allowed. 

So what is pK when the problem is infeasible? Let 

conv(Yfx) ■■= \^^Piyifxi\p e A„| C JV 

i 

be the convex hull of the yifxi^- 

Theorem 3. When the primal is infeasible, the margi^ is 

\pk\ = I^max := sup|(5 | ||/||if <S^f€ conv(y(^x)| 

Proof (1) For inequality >. Choose any 5 such that / G comiiYfx) for any ||/||if < <5. Given an arbitrary 
/' G Tk with II/'IIif = l,put/:= -Sf. 

*We thank a reviewer for pointing out that by this definition, pK might always be 0 for infinite dimensional RKHSs because there 
are always directions perpendicular to the finite-dimensional hull - we conjecture the definition can be altered to restrict attention to the 
relative interior of the hull, making it non-zero. 
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By our assumption on <5, we have / G conv(y^x) implying there exists a p G A„ such that / = 
{Y^x,p) ■ Also 

{f,{Y^x,p))^ = {fJ)K 

= -s\\frK = -s. 

Since this holds for a particular p, we can infer 

inf (f,{Y^x,p)) <-S. 

peA„ \ / K 

Since this holds for any /' with ||/'||g = we have 

sup inf (f',{Y^x,p)) <-6i.e. \pk\>S. 

||/||k=ip6A„ \ Ik 

(2) For inequality <. It suffices to show \\f\\K A \pk\ / G conv{Y(j)x)- We will prove the contrapositive 

/ ^ conv(y(^x) ^ ll/lk > IpkI- 

Since A„ is compact and convex, conv(y <j>x) C J^k is closed and convex. Therefore if / ^ conv(F (px), 
then there exists g G Tk with \\g\\K = 1 that separates / and conv(F (j)x), i-C- for all p G A„, 

{9,f)K < Oand{g,{Y(j)x, p))k >0 

i.e.{g,f)K < ini {g,{Y^x,p))K 

< sup ini {f,{Y^x,p))K = PK- 

||/||^ = 1P6A„ 

Since px<0 \pk\ < \{f,g)K\ 

< ll/lk||ff|k = ||/|k. 

□ 

6 Kernelized Primal-Dual Algorithms 

The preceding theorems allow us to write a variant of the Normalized VonNeumann algorithm from the 
previous section that is smoothed and works for RKHSs. Define 

Vk := G A„| '^PiVi^xi = o| = |p G A„ IIpIIg = o| 

i 

as the set of witnesses to the infeasibility of the primal. The following lemma bounds the distance of any 
point in the simplex from the witness set by its || .|1 g norm. 

Lemma 7. For all q G A„, the distance to the witness set 

dist(g, W) := min \\q - wh < min i V2, 

wew \pk\ 

As a consequence, ||p||g = 0 iffp G W. 

Proof. This is trivial for p G W. For arbitrary p G A„\kk,letp := — so that \\fY fx ,p)\\k = IIpIIg < 
\pk\. 
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Hence by Theorem |2 there exists a G A„ such that 

{Y^x,a) = {Y^x,p)- 

Let ^ = Aa + (1 - X)p where A = Then 

= II II ^1 I — i{y^xA\p\\gp+\pk\p) 

WpWg + \pk\ 

= 0 , 

so [3 G W (by definition of what it means to be in W) and 

\pk\ /■ 

We take min with a/ 2 because px might be 0. □ 

Hence for the primal or dual problem, points with small G-norm are revealing - either Lemma [2 shows 
that the margin px < ||p||g will be small, or if it is infeasible then the above lemma shows that it is close to 
the witness set. 

We need a small alteration to the smoothing entropy prox-function that we used earlier. We will now use 

dq{p) = \\\p-q\\l 

for some given q G A„, which is strongly convex with respect to the £2 norm. This allows us to define 
Plio) = arg min (Ga,p) + ^\\p - q\\l, 

'P^^n 2i 

Ll{a) = sup - {a,p)G - pdg{p) \ + i||a||^, 

peA„ L J 

which can easily be found by sorting the entries of g — —. 


lb - /II 2 = A|b - a||2 < XV2 < min \V2, 


Algorithm 6 Smoothed Normalized Kernel Perceptron-VonNeumann (SNKPVN{q, S)) 
Input q G A„, accuracy 5 > 0 
Set ao = q, po := 2n, po := (ao) 
for A: = 0,1,2,3,... do 
if Gak > 0„ then 

Halt: ak is solution to Eq. ® 
else if Ibfcllc < 5 then 
Return pk 
else 
dk •= 

ak+i := (1 - 6k){ak + 9kPk) + 9l Pl^{o-k) 

Pk+i = (1 — 9k)pk 

Pk+i ■= (1 - 9k)pk + 9k Pl^^^{ak+i) 

end if 
end for 


When the primal is feasible, SNKPVN is similar to SNKP 


12 









Lemma 8 (When pk > 0 and S < pk)- For any q G A„, 

-\\\Pk\\G > > L{ak)- Pk- 

Hence SNKPVNfinds a separator f in O iterations. 

Proof. We give a unified proof for the first inequality and Lemma|5]in the Appendix. The second inequality 
mimics Lemma|4] The final statement mimics Theorem[T] □ 


The following lemma captures the near-infeasible case. 

Lemma 9 (When < 0 or <5 > pk)- For any q G A„, 

-^Ibfcllc > > - ^Pkdist{q,Wf. 

Hence SNKPVN finds a 5-solution in at most O ^min | }) derations. 

Proof. The first inequality is the same as in the above Lemma|^ and is proved in the Appendix. 


= sup - {a,p)G - Pkdgip) \ -f i||a|| 
peA„ L J 


> sup - {a,p)G - Pkdqip) 
p€W 


= sup - ipfelb-g||f 


_ 1 ... 11^ ^Il2 

p€W I 
= -^pkdist{q,Wf 

Pk min 12, | using Lemma |2l 


> -t 


Since pk = (fc+ 4 "fc+ 2 ) < we get 


4n 


Ibfcllc < 7 -r^^ niin \ ^2, 


(fc + 1) 

Hence Ibllc < <5 after min | steps. 

Using SNKPVN as a subroutine gives our final algorithm. 


PK 


□ 


Algorithm 7 Iterated Smoothed Normalized Kernel Perceptron-VonNeumann {ISNKPVN{'j, e)) 
Input constant 7 > 1, accuracy e > 0 
Set qo := l„/n 
for f = 0, 1,2,3,... do 
5t ■■= Iktllc/T 
qt+i ■.= SNKPVN{qt,St) 
if (5t < e then 

Halt; qt+i is a solution to Eq. (HI 

end if 
end for 


Theorem 4. Algorithm ISNKPVN satisfies 
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1 . 


If the primal m is feasible and e < pK, then each call to SNKPVN halts in at most iterations. 
Algorithm ISNKPVNfinds a solution in at most outer loops, bounding the total iterations by 



2 . 


If the dual M4^ is feasible or e > px, then each call to SNKPVN halts 


steps. Algorithm ISNKPVN finds an e-solution in at most 
iterations by 


O 


^min 


^/n 1 


log 


log(ye) 

log(7) 



in at most O ^min | | ^ 

outer loops, bounding the total 


Proof. First note that if ISNKPVN has not halted, then we know that after t outer iterations, qt+i has small 
G-norm; 

Ikt+lllG <0t< 

The first inequality holds because of the inner loop return condition, the second because of the update for 5t. 


1. Lemma[3shows that for all p we have px < IIpIIgj so the inner loop will halt with a solution to the 

primal as soon as 5t < px (so that ||p||g < St < px cannot be satished for the inner loop to return). 
From Eq. (fTSl) . this will dehnitely happen when < px, ie within T = iterations. 

By LemmaHl each iteration runs for at most steps. 

2. We halt with an e-solution when St < e, which dehnitely happens when < e, ie within 

T = iterations. Since = 7 , by Lemma |9] each iteration runs for at most 


7 Discussion 

The SNK-Perceptron algorithm presented in this paper has a convergence rate of ” and the Iterated 
SNK-Perceptron-Von-Neumann algorithm has a min | | dependence on the number of points. Note 

that both of these are independent of the underlying dimensionality of the problem. We conjecture that it is 
possible to reduce this dependence to yiogn for the primal-dual algorithm also, without paying a price in 
terms of the dependence on margin 1/p (or the dependence on e). 

It is possible that tighter dependence on n is possible if we try other smoothing functions instead of the 
£2 norm used in the last section. Specifically, it might be tempting to smooth with the || .||g semi-norm and 
define; 

p^ia) = arg min(a,p)G -f ^\\p-q\\% 

One can actually see that the proofs in the Appendix go through with no dimension dependence on n at all! 
However, it is not possible to solve this in closed form - taking a — q and p = 1 reduces the problem to 
asking 

pHq) = arg min i||p||^ 

which is an oracle for our problem as seen by equation (fT4l) - the solution’s G-norm is 0 iff the problem is 
infeasible. 

In the bigger picture, there are several interesting open questions. The ellipsoid algorithm for solving 
linear feasibility problems has a logarithmic dependence on 1 /e, and a polynomial dependence on dimension. 
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Recent algorithms involving repeated rescaling of the space like the one by Dunagan & Vemnala ( 2008h 
have logarithmic dependence on 1/p and polynomial i n dimension . Whil e both these algorithms are poly¬ 
time under the real number model of computation of Blum et ^ (Il998h . it is unknown whether there is 
any algorithm that can achieve a polylogarithmic dependence on the margin/accuracy, and a polylogarithmic 
dependence on dimension. This is strongly related to the open question of whether it is possible to learn a 
decision list polynomially in its binary description length. 

One can nevertheless ask whether rescaled smoothed perceptron methods like lDunagan & VempalalOOOSh 
can be lifted to RKHSs, and whether us ing an iterated smoothed kernel perceptron would yield faster rates. 
The recent work Soheili & Penal ( 2013a ) is a challenge to generalize - the proofs relying on geometry involve 
arguing about volumes of balls of functions in an RKHS - we conjecture that it is possible to do, but we leave 
it for a later work. 
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A Unified Proof By Induction of Lemma 5, 8: L^^{ak) < — 


Let d{p) be 1-strongly convex with respect to the ^-norm, ie d{q) — d{p) — {Vd{p), q — p) > — p|||;for 

any p,q G A„. Let the ^^(fl-norm be lower bounded by the G-norm as WpWq < A:^||p|||,. For d{p) = 
^jPilogpi -I- logn, # is the 1-norm, = 1 and p* = For d{p) = W\q — plU, # is the 2-norm, 

A^ = n and p* = q. Choose po = 2A^. 

Let the smoothed minimizer be defined by (a) := aigmmp^^^{Ga,p)-\-pd{p), andp* := argmin^gAn d{p). 
The optimality condition of Pfi{a) and p* (the gradient is perpendicular to any feasible direction) is that for 
any r G A„, 


{GaJ-pVd{pp{a)),r - p) = 0 

{Vd{p*),r-p) = 0 ^ (i(po) > sibo-P*||| 


(16) 

(17) 


For k = 0 : 


--Ml 


> 

> 


-M-P*\\l-(plP0-P*)G-k\\P*fG 
-^\\po-p%-{p\po)G + h\\P*\\G 

-pod{po) - {ao,po)G + ^IbollG 

L/j^o (cto)- 


writing po = {po - P*) + P* 
using Ibllc < A#|bll| 
adding -^\\po - p*\\l, using Eq. (2) 
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Assume it holds upto k. We drop index k, and write x+ for Xk+i- Let p = (1 — 6)p + 9p^{a) so a+ = 
il-0)a + ep. (3) 

Lf,+ {a+) = ^\\a+\\l;- (^a+,Pf,^{a+)'j^- p+d{p^^{a+)) 

= 5||(1 - ^)“ +^p|Ig “ ^(^>^'t‘+(“+))G “ {a,PtJ.+ ia+)^^ + pd{p^_^{a+)) using Eq. (3) 

< (1-0) i||a||^-^a,p^+(a+)^^-pd(p^+(a+)) +0 - ^\\p\\% - (p,Pfj,^{a+) - , 

where we used the convexity of || .||q. Recall p+ = (1 — 0)p + 0p^j^ («+)> so that p+ — p = 0{Pfi^ (ct+) ~ 

Pti{a))- (4) 

= sllallc -- (oi,p^^{a+)-p^{a )^^-p d{p^^{a+))-d{p^{a)) 

= L^{a)-p d{pf,^{a+)) - d{p^ia)) - (vd{p^{a)),p^_^^{a+) - Pf,{a)'j usingEq.(l) 

< -^Ibllc - 2 bA‘+(“+) using strong convexity of d(p) 

< -ilb+( p-p)IIg - 2 l^lbM+(a+)-PM(a)llG using IIpII^ < A#||p||| 

< -llbllc- {p,p-p)c- 2 A# 02 lb+ -pIIg using Eq. (4) and dropping a-i||p - p|b term. 

Using (1 — 0){p — p) = —0{p^{a) — p) and substituting back, 

Lp+(a+) < (1-6*) - 5lbllG +T=e(P)PM(«)- p)g-^^^ lb+- pIIg -\\\p\\g - {p^P^^+{o^+) - p)c 

= -\\\p\\g - ^(p.P^^+ (a+) - P^^{ol)^c ~ 

< -h\\p\\l-{p.P+-p)a-h\\P+-P\\l usingEq. (4)and^ = ^^^^4^<^_4^^^ 

= -^Ib+llG- 

This wraps up our unified proof for both settings. 
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